Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

Use EtcdStore rather than TCPStore when using etcd_rdzv #34

Closed
wants to merge 2 commits into from

Conversation

kiukchung
Copy link
Contributor

Summary:
Background:

The rdzv interface returns a store as part of the next() API. We used to return a TCPStore since prior to torch 1.4 it was not possible to provide a python implementation of the c10d::Store interface because no trampoline class existed in the pybind definition.

TCPStore had a chicken&egg problem in the unittest context since the constructor on the "master" (rank0 == where the tcp store was hosted) block waits until all workers have joined and the workers need the ip and port of the master. However, finding an unused port and passing it to the TCPStore's constructor and workers has a race condition (which is exacerbated during stress tests). Hence we had to run the tests in serialmode. This is no longer necessary for EtcdStore.

There are two pending github issues that need this change (see attached tasks)

Differential Revision: D19511842

Kiuk Chung added 2 commits January 21, 2020 21:32
Differential Revision: D19507117

fbshipit-source-id: a02c267cc9ffacec7bcb9bac4608e04a255f6540
Summary:
Background:

The rdzv interface returns a store as part of the `next()` API. We used to return a TCPStore since prior to torch 1.4 it was not possible to provide a python implementation of the `c10d::Store` interface because no trampoline class existed in the pybind definition.

TCPStore had a chicken&egg problem in the unittest context since the constructor on the "master" (rank0 == where the tcp store was hosted) block waits until all workers have joined and the workers need the ip and port of the master. However, finding an unused port and passing it to the TCPStore's constructor and workers has a race condition (which is exacerbated during stress tests). Hence we had to run the tests in `serial`mode. This is no longer necessary for `EtcdStore`.

There are two pending github issues that need this change (see attached tasks)

Differential Revision: D19511842

fbshipit-source-id: bccec3c9663a6dbb690c1d7f610f9f546736128d
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D19511842

@kiukchung
Copy link
Contributor Author

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 393a26c.

@kiukchung kiukchung deleted the export-D19511842 branch September 1, 2020 04:36
fotstrt pushed a commit to eth-easl/elastic that referenced this pull request Feb 17, 2022
Summary:
Pull Request resolved: pytorch#34

Background:

The rdzv interface returns a store as part of the `next()` API. We used to return a TCPStore since prior to torch 1.4 it was not possible to provide a python implementation of the `c10d::Store` interface because no trampoline class existed in the pybind definition.

TCPStore had a chicken&egg problem in the unittest context since the constructor on the "master" (rank0 == where the tcp store was hosted) block waits until all workers have joined and the workers need the ip and port of the master. However, finding an unused port and passing it to the TCPStore's constructor and workers has a race condition (which is exacerbated during stress tests). Hence we had to run the tests in `serial`mode. This is no longer necessary for `EtcdStore`.

There are two pending github issues that need this change (see attached tasks)

Reviewed By: isunjin

Differential Revision: D19511842

fbshipit-source-id: 13fff370668d110fedc436684226e41ad3e69df9
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants